## Preliminaries: Install all the required packages.

To install Pytorch, Pytorch-Geometric, OGB for OGB datasets and GNN models:
```
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.1+cu110.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.1+cu110.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.1+cu110.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.1+cu110.html
pip install torch-geometric
pip install ogb
```
Note that its crucial to have a consistent cuda version for the installed package and your local cuda.
If you encounter any issue, see Pytorch-Geometric's [FAQ](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html#id1).


To install PECOS,
```
pip install libpecos==0.1.0
```

See also **requirement.txt** for version of the other standard packages such as numpy.

## Step 0: Get Raw_text.txt.

In order to run our method correctly, we assume to have the raw text file (Raw_text.txt). The ith row/line should be the raw text for node i, matching the node_id in the ogb dataset.

For ogbn-arxiv, we use the title and abstract as raw text, provided in the ogbn-arxiv dataset. (in the folder `./dataset/ogbn_arxiv/mapping` you can find node_id to paper_id index and the raw text can be downloaded from [OGB](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv).)

For ogbn-products, we use the product description as raw text. We use the provided ASIN to extract the product description (in the folder `./dataset/ogbn_products/mapping`), where the mapping can be found in [1] using Amazon3M dataset.

We also applied simple text cleaning on the raw text using `clean_text.py`. The instruction of using it is as follows.
```
python clean_text.py {Enter path to raw text} {Enter path to saved clean text}
```

Reference:

>[1] K. Bhatia, K. Dahiya, H. Jain, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016.


## Step 1: Put your Raw_text.txt in the correct folder.
For ogbn-arxiv dataset, put Raw_text.txt in the folder `./data_for_XRTransformer/ogbn-arxiv`

For ogbn-products dataset, put Raw_text.txt in the folder `./data_for_XRTransformer/ogbn-products`

## Step 2: Generate tfidf feature.
First run the following script to build the tfidf model.

```
python -m pecos.utils.featurization.text.preprocess build \
  --text-pos 0 \
  --input-text-path {Enter the path to raw text. i.e. ./data_for_XRTransformer/ogbn-arxiv/Raw_text.txt} \
  --vectorizer-config-path ./TFIDF/config.json \
  --output-model-folder {Enter the dir_path to save your model. i.e. ./TFIDF/ogbn-arxiv/tfidf-model}
```

Then run the following script to generate the tfidf feature.

```
python -m pecos.utils.featurization.text.preprocess run \
  --text-pos 0 \
  --input-preprocessor-folder {Enter the dir_path to saved your model. i.e. ./TFIDF/ogbn-arxiv/tfidf-model} \
  --input-text-path {Enter the path to raw text. i.e. ./data_for_XRTransformer/ogbn-arxiv/Raw_text.txt} \
  --output-inst-path {Enter the path to save the generated tfidf feature. i.e. ./TFIDF/ogbn-arxiv/tfidf_feature.npz}
```

## Step 3: Prepare data for XRTransformer.
Run the following script. All data will be saved to `./data_for_XRTransformer/${dataset}/` by default. It will save the adjacancy matrix as Y (XMC label). Also, we will apply degree filtering during our training phase (default to filter out nodes with degree greater than 1000).

```
python Prepare_data_for_XRTransformer.py \
	--dataset {Enter your dataset. i.e. ogbn-arxiv} \
	--raw_text_path {Enter your path to raw text. i.e. ./data_for_XRTransformer/ogbn-arxiv/Raw_text.txt} \
	--tfidf_path {Enter your path to tfidf feature. i.e. ./TFIDF/ogbn-arxiv/tfidf_feature.npz}
```

## Step 4: Train XRTransformer.
For ogbn-arxiv, run the shell script: `xrtransformer_SSL_ogbnarxiv.sh`

For ogbn-products, run the shell script: `xrtransformer_SSL_ogbnproducts.sh`

Note that for ogbn-products, it might show some error messages **after** all matchers are trained. This is **fine**, as we only need the encoder in matchers but not the ranker part.

Also remember to grant permission to the shell script. For example, use
```
chmod a+x {script name}.sh
```

****

## Step 5: Generate node features from XRTransformer.
The the folloing script to generate the node features from trained XRTransformer.

```
python Generate_node_features_from_XRTransformer.py \
	--dataset {Enter for which dataset. i.e. ogbn-arxiv} \
	--max_level 4 
```

The generated node features will be saved under the folder `./data_for_XRTransformer/{dataset}/Results/`

In this example, embedding from level 0 to level 3. The corresponding embedding will be saved as `X.XRT.Lv${level}.npy`, where `${level}` will be 0 to 3.


## Step 6: Run downstream GNNs.
For standard GNN and MLP, please check `./OGB_baselines` with the corresponding dataset folder.

For state-of-the-art GNN (to date July 01/2021):

- For ogbn-arxiv, check `DeepGCN` folder.
	
- For ogbn-products, check `SAGN` folder.

According to our experiment:

- For ogbn-arxiv, using the embedding from level 3 (`X.XRT.Lv3.npy`) has the best performance.

- For ogbn-products, using the embedding from level 1 (`X.XRT.Lv1.npy`) has the best performance.

***

## To generate the node features with our tested baseline:

### Vanilla Bert
Run the following script. Note that it required to have the tokenized text .pt file in `./data_for_XRTransformer/{dataset}/`. So we recommend to run our XRTransformer once first. The resulting node features will be saved at `./data_for_XRTransformer/{dataset}/Results/VanillaBert.npy`

```
python Generate_node_features_from_Bert.py \
	--dataset {Enter the dataset. i.e. ogbn-arxiv} 
```

### Bert+SSL Link prediction
Run the following script. The resulting node features will be saved at `{model_dir}/Bert_SSL_LinkPred.npy`

- For ogbn-arxiv:

```
python BERT_SSL_LinkPred.py \
	--model_dir {Enter the dir_path to save the trained model.} \
	--text_path ./data_for_XRTransformer/ogbn-arxiv/Raw_text.txt \
	--dataset ogbn-arxiv \
	--save_steps 10000 \
	--warmup_steps 1000 \
	--max_steps 10000 
```

- For ogbn-products:

```
python BERT_SSL_LinkPred.py \
	--model_dir {Enter the dir_path to save the trained model.} \
	--text_path ./data_for_XRTransformer/ogbn-products/Raw_text.txt \
	--dataset ogbn-products
```

### For the other methods that follow the standard baseline, just download the code from the corresponding repository and change the input features akin to downstream GNNs.

- GraphZoom: https://github.com/cornell-zhang/GraphZoom

- For GAE, VGAE and DGI, they are already included in the PyG library. We provide the executable code in `./SSL_baselines`. Please check the README file there.



 
